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PRINCIPLES  OF  WORK  SAMPLE  TESTING:  I.  A NON-EMPIRICAL  TAXONOMY  OF 
TEST  USES 

BRIEF 


Because  classical  psychometric  theory  often  seems  inadequate  for 
the  development  and  evaluation  of  work  sanple  tests,  and  because  recent 
challenges  to  classical  theory  have  had  promising  implications,  the 
conceptual  foundations  of  work  sample  testing  need  to  be  examined  and 
clarified.  This  report,  the  first  in  a series  of  four,  attempts  to 
provide  a background  for  that  examination  by  considering  the  full  scope 
of  measurement  in  psychology.  The  purpose  is  to  determine  whether 
different  kinds  of  measurement,  or  different  circumstances  of  measure- 
ment, have  different  implications  for  the  development  and  evaluation 
of  measurement  procedures. 

The  most  fundamental  approach  to  measurement  is  mathematically 
formal;  it  conforms  to  certain  mathematically  stated  axioms,  principally 
the  axiom  of  transitivity.  Fundamental  measurement  is  expressed  in  for- 
mally defined  units  which  are  widely  accepted  throughout  the  scientific 
comnunity.  The  use  of  such  formal  measurement  provides  rather  direct 
descriptions,  with  little  or  no  need  for  inferences,  of  the  attributes 
of  objects  being  measured.  An  example  of  such  measurement  is  linear 
distance.  One  does  not  speak  of  "inferring"  the  length  of  an  object 
through  measurement,  although  it  would  be  true,  because  the  inference 
and  the  fact  of  the  measurement  are  very  nearly  the  same. 

Most  measurement  in  psychological  research,  and  particularly  the 
measurement  described  by  classical  psychometric  theory,  provides  only 
signs  from  which  inferences  are  drawn  about  the  attributes  of  interest. 
The  unit  of  measurement  is  typically  the  standard  deviation  of  the 
distribution  of  a set  of  measurements , not  a mathematically  defined 
formal  unit;  traditional  psychometric  measurement  is  said,  therefore, 
to  be  "norm-referenced."  That  is,  the  meaning  of  a score  is  defined 
relative  to  its  position  within  the  distribution  of  scores;  in  con- 
trast, fundamental  measurement  can  be  applied  to  the  single  case, 
defining  the  meaning  of  a "score"  in  terms  of  the  units  of  measurement 
in  the  scale  used. 

Three  challenges  to  classical  psychometric  theory  have  gained  in 
attention  in  recent  years.  One  of  these  is  a trend  toward  greater 
preference  for  content-referenced  measurement  as  distinguished  from 
norm-referenced  measurement.  Another  is  latent  trait  theory,  which 
provides  an  analog,  at  least,  to  a mathematically  formal  unit  of 
measurement.  The  third  is  generalizability  theory,  which  seeks  a more 
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precise  understanding  of  the  errors  of  measurement.  All  of  these 
challenges  seem  to  have  special  significance  for  work  sarrple  testing. 

Ob  provide  a framework  within  which  to  consider  classical  psycho- 
metric theory  and  these  challenges,  a tentative  taxonony  of  psycholog- 
ical measurement  is  proposed.  With  it,  the  special  issues  in  work 
sanple  testing  can  be  viewed  in  the  larger  context  of  measurement  in 
psychology  generally.  Pour  specific  taxonomies  are  proposed:  classi- 
fications of  (a)  the  purposes  of  measurement,  (b)  settings  in  which 
measurements  are  obtained,  (c)  variables  or  attributes  to  be  measured, 
and  (d)  the  methods  of  measurement  in  psychology. 

Six  broad  purposes  of  measurement  are  identified: 

1.  Evaluation  of  materiel,  processes,  or  programs  to  permit 
organizational  decisions  to  be  made  about  them. 

2.  Organizational  trouble  shooting  to  identify  needs  for  correc- 
tive actions  concerning  personnel  units. 

3.  Individual  diagnosis  identifying  strengths  and  weaknesses 
of  individuals,  either  internally  or  relative  to  others. 

4.  Certification  of  individual  proficiency  or  need,  or  levels  of 
these,  such  as  in  the  skill  qualification  testing  program. 

5.  Prediction  of  future  performance  or  characteristics  of  indi- 
viduals, such  as  prediction  for  selection  decisions. 

6.  Evaluation  of  other  measurements,  such  as  the  use  of  one 
measurement  as  a criterion  in  the  validation  of  another  one. 

'Three  types  of  measurement  settings  are  defined.  Types  of  vari- 
ables are  presented  under  two  subheadings,  attributes  of  people  and 
attributes  of  tasks.  Seven  categories  of  the  personal  attributes  are 
listed  in  decreasing  order  of  objectivity  of  measurement,  and  a simi- 
lar order  is  tentatively  proposed  for  nine  categories  of  task  variables. 
Five  kinds  of  measurement  methods  are  identified,  again  in  decreasing 
order  of  probable  objectivity  in  measurement,  ranging  from  the  use  of 
special  instrumentation  to  the  use  of  ratings. 

Most  purposes  of  measurement  require,  at  least  for  the  evaluation 
of  measurements,  at  least  the  potential  for  substantial  variance;  argu- 
ment that  mastery  testing,  for  exanple,  should  have  lew  variance  is 
rejected.  Regardless  of  purpose,  seme  form  of  generalizability  is 
needed,  although  the  diagnostic  and  certification  purposes  emphasize 
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the  generalizability  of  scores  while  prediction  requires  general iz-  ; 

ability  of  relationships.  Regarding  the  categories  of  settings,  the  . 

sane  statement  is  appropriate:  genaralizah"  vity  across  settings,  \ 

either  of  scores  or  of  relationships,  soamr  miversally  necessary.  j 

! 

f 

Ihe  implications  of  the  joint  classification  of  variables  and  of  j 

the  methods  for  measuring  than  provide  more  diverse  implications . { 

For  the  most  highly  objective  combinations,  measurement  must  be  accur-  ' 

ate  and  interpretable  in  relation  to  a standard.  Since  work  sample  < 

tests  strive  for  objectivity,  the  same  implications  exist  for  them. 

Hie  more  subjective  combinations  require  research  into  the  acceptabil- 
ity of  possible  inferences  as  the  principal  form  of  evaluation. 
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INTRODUCTION  j 

The  well-established  technology  for  aptitude  testing  seems  inade- 
quate for  sane  purposes,  including  certification  testing  by  work  sam-  1 

pies.  In  recent  years,  challenges  to  classical  psychometric  theory 
have  come  from  many  sources  (e.g. , Cronbach,  Gleser,  Nanda,  & Rajaratnam, 

1972;  Lord,  1952;  Muir,  1977;  FOpham  & Husek,  1969).  This  report,  and 
the  three  that  follow,  will  consider  both  classical  theory  and  its  j 

challenges  in  examining  (a)  some  special  problems  of  work  sample  test-  J 

ing,  (b)  some  relatively  new  developments  in  measurement,  and  (c)  j 

some  old  measurement  ideas  that  are  often  ignored  in  psychometric  dis- 
cussions. By  examining  and  clarifying  the  conceptual  foundations  of 
work  sample  testing,  these  papers  will  offer  principles  for  the  con- 
struction, use,  interpretation,  and  evaluation  of  work  sample  tests 
in  the  broader  context  of  general  problems  in  the  measurement  of 
psychological  variables. 

The  present  report  will  identify  the  place  of  work  sample  testing 

* 

in  the  context  of  a non-empirical  taxonomy  of  general  psychological 
measurement.  The  taxonomy  will  be  described,  and  its  implications  for 
test  evaluation  will  be  presented  with  special  emphasis  on  work  sample 
testing.  The  second  paper  looks  broadly  at  the  scope  of  systems  for 
evaluation  of  personnel  testing  programs.  Evaluation  includes  psycho- 
metric concepts  of  validity,  but  it  is  not  restricted  to  them. 

i 

With  these  broad  perspectives  as  context,  the  third  paper  will  ] 

focus  explicitly  on  the  construction  and  validation  of  work  saiple 
tests.  Since  the  principal  requirement  to  satisfy  in  work  sample  test- 
ing  is  generalizability  of  scores,  the  final  paper  in  the  series  will  f 

be  concerned  explicitly  with  the  problems  and  opportunities  of  differ- 
ent kinds  of  generalizability  research  for  work  samples  and  work 
sample  validities. 
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A SYNOPSIS  OF  MEASUREMENT  THEORY 

Measurement  is  a characteristic  scientific  endeavor.  No  field  of 
scientific  enterprise  can  progress  far  without  operationally  defining, 
classifying,  and  quantifying  its  variables.  Applied  science  relies 
especially  heavily  on  the  quantification  of  its  subject  matter. 
Measurement  is  not  unique  to  psychological  research,  nor  is  preoccupa- 
tion with  an  underlying  theory  of  measurement  a special  prerogative  of 
psychometrics . 

Fundamental  to  any  discussion  of  measurement  is  the  fact  that  one 
does  not  measure  objects  or  people;  rather,  one  measures  attributes  of 
objects  or  people.  Measurement  implies  the  assignment  of  numbers  to 
represent  attributes  according  to  some  specified  set  of  rules.  Systems 
for  assigning  numbers  can  be  devised  for  representing  the  weight  of 
objects,  the  amount  of  information  in  a message,  the  amount  of  percep- 
tual skill  characterizing  an  individual,  or  the  quality  of  an  indivi- 
dual's performance.  An  acceptable  system  of  measurement  assigns- 
numbers  to  represent  only  one  attribute;  other  numbers,  assigned 
according  to  other  rules,  can  represent  other  attributes  of  the  same 
objects,  messages,  people,  or  performance. 

KINDS  OF  MEASUREMENT 

It  is  useful  to  distinguish  different  kinds  of  measurement.  Some 
approaches  to  measurement  are  so  constructed  that  the  numerical  result 
in  measuring  an  attribute  of  something  is  understood  primarily  with 
reference  to  the  measurement  system  itself.  In  such  measurement  systems, 
the  rules  for  assigning  numbers  to  represent  quantitites  are  so  definite, 
unambiguous,  and  widely  accepted  that  an  obtained  number  has  an  imme- 
diate and  obvious  descriptive  meaning.  Many  of  these  intrinsically 
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obvious  measurement  processes  have  a foundation  in  clearly  established 
natural  law.  Ttorgerson  (1958)  referred  to  such  measurement  as  funda- 
mental. The  best  examples  of  such  fundamental  measurement  are  physical 
measurements  such  as  counting  objects,  weighing  objects,  measuring 
distances,  and  the  like. 

He  also  described  derived  measures,  those  which  are  derived  from 
fundamental  measurement  with  a similar  kind  of  internally  consistent 
meaning.  Examples  include  more  complex  kinds  of  physical  measurement, 
such  as  the  measurement  of  density  as  a ratio  of  mass  to  volume.  While 
these  nay  not  take  their  meaning  in  a wholly  internal  way,  as  in  more 
nearly  fundamental  measurements,  they  take  their  meaning  in  the  rela- 
tionships of  established  scientific  law  relating  an  attribute  to  other 
attributes. 

Both  kinds  of  measurement  described  above  are  mathematically 
formal  systems;  that  is,  they  conform  to  certain  basic  mathematical 
axioms  such  as  those  of  transitivity  or  additivity,  and  that  conformity 
can  be  demonstrated  through  formal  mathematical  proofs. 

Although  the  most  obvious  examples  of  mathematically  formal 
measurement  are  physical  measurements,  psychology  is  not  without  such 
formal  systems  of  its  cwn.  Quite  apart  from  the  obvious  behavior 
frequency  counts  (which  are,  of  course,  physical  measurements  of  rate 
of  occurrence) , psychology  has  specialized  fields  of  mathematical 
measurement  theory  such  as  information  theory  and  signal  detection 
theory.  Ihese  approach  measurement  formally  with  neither  interest  in 
nor  need  for  the  conventional  psychometric  theory  developed  for  tradi- 
tional mental  testing. 

In  contrast,  other  measurement  derives  meaning  inferentially 
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more  than  directly  descriptively.  For  example,  forital  physical  meas- 
urements my  be  used  to  describe  directly  an  attribute  from  which  some 
other  attribute  is  inferred;  we  speak  (perhaps  erroneously)  of  having 
"measured"  the  inferred  attribute.  An  excellent  exairple  is  the  galvanic 
skin  response?  literally,  one  measures  electrical  resistance  on  the 
surface  of  the  skin,  but  changes  in  that  resistance  are  used  for 
inferring  changes  in  emotionality,  and  GSR  is  said  to  be  a measure  of 
emotion. 

Much  of  psychological  measurement  is  derived  measurement,  but  it 
is  statistically  derived.  It  is  not  derived  from  statements  of  invar- 
iant lawfulness,  as  in  the  measurement  of  density;  it  is  formally 
derived  from  statistical  analyses  and  assumptions.  The  best  examples 
of  statistically  derived  measurement  in  psychology  are  those  s terming 
from  research  in  psychophysics , such  as  using  Thurstone ' s Law  of 
Conparative  Judgment  (Thurstone,  1959) . The  early  history  of  mental 
testing  proceeded  in  an  analogous  way;  each  item  in  a test  was  treated 
as  a stimulus  item,  the  response  to  which  had  seme  probability  of  pro- 
viding an  appropriate  inference.  The  probability  of  an  appropriate 
inference  was  increased  by  repeated  stimulation,  i.e. , by  using 
several  items  to  make  up  a total  measure  or  score.  Modem  computer- 
ized adaptive  or  tailored  testing  is  a further  exanple  of  statisti- 
cally derived  measurement,  differing  from  earlier  testing  more  in 
mathematical  sophistication  than  in  principle. 

Statistically  derived  measurement  is  no  less  formal,  and  no 
less  rigorous,  than  mathematically  formal  measurement  derived  from 
fundamental  measurements.  Most  statistically  derived  psychological 
measurement  has  its  own  unique  "mental  unit  of  measuremenr.’*  (Thurstone, 
1959,  p.  50) . Whether  it  is  the  discriminal  dispersion  of  judgments 
of  scale  separation,  the  variance  in  a set  of  test  scores,  or  a 
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hypothetical  scale  for  measuring  latent  ability,  the  unit  of  measure- 
ment in  most  mental  measurement  is  the  standard  deviation. 

Inferences  nay  also  be  drawn  from  less  formally  developed  measur- 
ing instruments.  A fourth  category  of  the  kinds  of  measurement  includes 
what  can  best  be  described  as  intuitive  measurements.  Many  index  num- 
bers are  established  by  intuitively  coirbining  a host  of  considerations; 
ad  hoc  tests  may  be  constructed  without  prior  statistical  analysis 
but  with  sate  degree  of  rational  thought;  perhaps  the  best  example  of 
intuitive  measurement  is  the  ubiquitous  five-point  rating  scale  which 
is  applied  willy-nilly,  without  any  formalisms  or  supporting  data. 
Intuitive  measures  can  be  highly  useful.  Much  of  economic  theory  has 
been  developed  using  such  index  numbers.  As  research  progresses  with 
such  measurement  schemes,  lawful  relationships  are  often  identified 
which  permit  the  development  of  more  formal  approaches  to  the  measure- 
ment of  the  same  variables. 

MEASUREMENT  OF  WORK  SAMPLE  PERFORMANCE 


VJbrk  sample  testing  may  use  all  of  the  kinds  of  measurement  in 
measuring  attributes  of  either  the  work  process  or  of  the  product 
(Shimberg,  Esser,  & Kruger,  1972) . Intuitive  scales  may  be  used  to 
rate  or  evaluate  the  process.  Performance  might  be  scored  like  paper- 
and-pencil  tests  are  scored  (which  seme  fonts  of  work  sample  tests 
actually  are) , using  the  theoretical  foundations  and  principles  for 
selecting  items  and  evaluating  scores  used  in  traditional  test 
construction  and  evaluation.  Fundamental  measures  may  be  used  to 
describe  the  product  or  result  of  performance;  quality  of  performance 
can  be  inferred  by  weighing,  by  determining  a physical  breaking 
point,  measuring  conformity  to  tolerances,  or  by  using  other  forms  of 
fundamental,  physical  measurement  of  chosen  attributes  of  a physical 
product. 
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Classical  psychometric  theory,  which  is  but  one  theory  among 
many,  does  not  traditionally  apply  to,  and  may  be  inadequate  for, 
sane  kinds  of  work  sample  measurement.  Much  mischief  and  confusion 
can  result  from  misguided  attenpts  to  squeeze  work  sample  testing 
into  the  same  rubric  used  for  the  evaluation  of  inferences  drawn  from 
aptitude  tests,  even  though  many  work  sample  variables  can  be  appro- 
priately handled  within  a conventional  psychometric  theory. 

In  considering  alternatives  for  the  evaluation  of  performance  on 
work  samples,  it  is  instructive  to  consider  challenges  to  traditional 
theory  that  have  been  offered  in  recent  years.  Perhaps  the  most 
active  field  of  challenge  is  that  known  as  content-referenced  measure- 
ment, among  other  names,  with  its  insistence  that  performance  be 
measured  not  in  terms  of  standard  deviations  from  a sample  or  popula- 
tion mean  but  in  terms  of  reaching  or  deviating  from  a specified  stan- 
dard level  of  performance  (Glaser  & Klaus,  1962). 

Another  emerging  challenge  to  traditional  psychometric  theory 
comes  from  latent  trait  theory  (Lord,  1952) , or  latent  structure  anal- 
ysis (Lazarsfeld,  1950) , which  attenpts  to  identify  item  characteris- 
tics as  essentially  sanple-free  estimates  of  item  parameters  instead 
of  item  statistics  based  on  the  sample  at  hand.  Characteristics  of  a 
test  can  then  be  defined  in  terms  of  the  characteristics  of  independent 
items  comprising  the  test. 

A third  challenge  comes  from  generalizability  theory  (Cronbach 
et  al. , 1972) , which  questions  the  adequacy  of  the  traditional  true 
score  and  error  score  division  of  obtained  scores;  it  works  instead  to 
allocate  the  portions  of  total  obtained  score  performance  among  various 
facets  or  conditions  of  measurement.  In  short,  generalizability 
theory  argues  that  it  is  the  consistency  or  dependability  of 


measurement  over  varying  conditions  that  is  the  important  point  in 
the  evaluation  of  measurement. 

These  challenges  are  all  relevant  to  the  development  and  evalua- 
tion of  work  sample  tests.  For  example,  if  a particular  work  sample 
is  devised  for  a welder,  and  if  all  of  the  people  who  are  administered 
the  test  perform  poorly  on  it,  there  is  little  benefit  to  be  derived 
from  identifying  certain  people  as  having  performed  better  than  others; 
the  significant  statement  is  the  content-referenced  interpretation 
that  they  all  performed  below  standard.  Since  work  is  rarely  conducted 
under  well-controlled,  standard  conditions,  the  stability  of  work 
sample  performance  across  a reasonable  range  of  circumstances  is  cer- 
tainly important.  The  applications  of  latent  trait  theory  are  perhaps 
less  obvious;  it  is  sufficient  here  to  note  that  such  applications 
can  provide  a basis  for  standardizing  interpretations  of  content- 
referenced  tests  over  different  samples  of  people  tested  in  different 
locations  or  at  different  times. 

In  short,  these  challenges  to  traditional  psychometric  theory, 
and  perhaps  others,  may  lead  to  a newer  and  firmer  foundation  for 
work  sample  testing.  It  is  therfore  useful  to  examine  work  sample 
testing  in  context  in  the  gamut  of  psychological  measurement.  The 
purpose  of  this  examination  is  to  determine  whether  different  kinds 
of  measurement,  or  measurement  in  different  circumstances,  have 
different  implications  for  the  development  and  evaluation  of  measure- 
ment procedures,  particularly  by  work  sample  testing. 

CONSIDERATIONS  FOR  A MEASUREMENT  TAXONOMY 

Psychological  measurement  does  not  occur  as  a disembodied  abstrac- 
tion. It  occurs  in  the  context  of  a broader  purpose  than  measurement 
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per  se,  and  it  occurs  within  a broader  environmental  context.  Purpose 
and  setting,  perhaps  as  much  as  the  measurer's  skill,  determine  what 
is  to  be  measured  and  hew  one  nay  go  about  it.  The  purposes,  settings, 
variables,  and  techniques  define  a "gamut  of  psychological  measurement" 
much  more  extensive  than  is  ordinarily  considered.  The  principal  pur- 
poses of  this  report  are  (a)  to  suggest  ways  in  which  each  of  these 
may  be  classified  and  (b)  to  suggest  implications  of  these  classifica- 
tions for  the  development  and  evaluation  of  specific  approaches  to 
measurement. 

Personnel  testing  — indeed,  the  testing  movement  as  a whole  — 
occupies  a relatively  small  portion  of  the  total  field  of  psychological 
measurement.  Wbrk  sample  testing,  even  broadly  defined,  occupies  a 
correspondingly  snail  place  in  the  personnel  testing  domain.  The 
tunnel  vision  of  overspecialized  theorizing  can  and  does  permit 
competent  theory  and  practice  in  that  branch  of  measurement  tradition- 
ally known  as  psychometrics,  but  test  theory  and  practice  can  be 
enriched  by  taking  cues  from  a broader  vision  of  measurement. 

The  implications  of  the  different  categories  can  sometimes  focus 
on  some  kinds  of  descriptions  of  appropriate  measurement,  descriptions 
that  can  be  expressed  as  sinple  dichotomies.  The  introductory 
remarks  have  emphasized  one  of  these,  the  distinction  between  funda- 
mental, descriptive  measurement  internally  interpretable  and  more 
nearly  intuitive,  inferential  measurement.  Classical  psychometric 
theory  emphasizes  the  latter.  It  has  also  been  pointed  out  that 
another  possible  dichotomous  classification  distinguishes  norm- 
referenced  from  content-referenced  measurement;  classical  theory 
addresses  the  former.  Another  possible  dichotomy  distinguishes 
measures  of  maximum  performance  from  measures  of  typical  performance; 
classical  theory  addresses  both  so  long  as  performance  can  be  inferred 
normatively. 
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performance;  examples  include  ratings,  counts  of  production  or  other 
achievements,  output/input  ratios  of  various  kinds  or  records  of  pro- 
duction, or  personnel  problems  over  a period  of  time. 

All  of  these  imply  seme  sort  of  work  sample  for  program  evalua- 
tion. The  term  is  being  used  terribly  broadly  here  to  make  the 
point;  proficiency  ratings,  for  example,  are  typically  assessment  of 
performance  sampling  a specified  period  of  time,  hence  of  a work 
sample  of  sorts.  Ihe  question  in  assessing  performance  in  these 
studies  is  not  whether  a sample  of  work  is  to  be  observed  and  eval- 
uated but  rather  how  effectively  the  sample  of  performance  can  be 
assessed.  The  first  question  in  evaluating  performance  measurement 
is  whether  the  sample  of  performance  observed  in  the  experimental 
setting  is  representative  of  performance  in  real  or  typical  or  tar- 
geted circumstances.  There  are  also  basic  questions  of  (a)  whether 
the  performance  is  directly  observed  or  only  vaguely  perceived  (as 
in  supervisory  ratings)  and  (b)  whether  the  nunbers  representing 
evaluations  of  performance  in  fact  reflect  irrelevant  attributes  of 
either  the  behavior,  the  worker,  or  the  observer. 

The  measurement  of  performance  in  the  experimental  situations 
typical  of  these  studies  is  rarely  concerned  with  individual  differ- 
ences. The  important  unit  of  analysis  is  the  group,  not  the  indivi- 
dual, and  the  typical  measure  of  interest  is  the  mean  performance  of 
various  experimental  or  control  groups;  "validity"  is  expressed  as 
the  significance  of  differences  between  these  mean  levels  of  perform- 
ance. Occasionally  the  variance  of  subgroup  performance  will  be 
the  statistic  of  interest.  Very  rarely  is  the  individual  measure 
the  measurement  of  concern  in  these  experimental  circumstances. 
Individual  differences  are  usually  (although  improperly)  treated  as 
error  variance.  The  reason,  of  course,  is  that  the  purpose  of  the 
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research  is  to  make  a decision  about  organizational  practices  or  pro- 
cedures, not  a decision  about  individuals. 

Organizational  Trouble  Shooting.  A potential  but  not  well 
explored  use  of  personnel  measurement  is  for  the  diagnosis  or  identi- 
fication of  organizational  problems  (Boyd,  1961) . Measures  of  job 
satisfaction  may  be  taken  in  different  aspects  of  an  organization  to 
try  to  identify  subgroups  who  may  be  pockets  of  discontent.  Job 
knowledge  tests  could  be  given  in  different  units  to  identify  similar 
pockets  of  ignorance.  Psychological  assessment  techniques  may  be 
used  to  identify  areas  of  inefficiency,  of  inappropriate  behavior, 
or  of  personnel  misclassification.  Most  such  studies  are  correlational 
in  nature;  such  studies  should  attempt  to  maximize  the  relevant 
variances  among  individuals,  somewhat  like  a magnifying  glass.  Other 
attempts  to  diagnose  organizational  problems  may  use  quasi-experimental 
designs;  in  these  studies  variance  within  groups  may  be  treated  as 
error  to  be  minimized  while  seeking  to  maximize  be tween-group  differences. 

Individual  Diagnosis.  The  term  diagnosis  is  not  restricted  to 
clinical  use.  In  many  personnel  testing  uses,  the  purpose  is  to 
identify  individual  strengths  and  weaknesses.  Sometimes  the  intent 
is  to  identify  a person's  own  relative  strengths  and  weaknesses, 
regardless  of  level.  In  other  cases,  one  asks  whether  one  individual 
measures  up  well  or  poorly  in  relation  to  others  or,  perhaps,  to 
some  standard  on  any  given  attribute.  These  are  inferential  measures; 
they  should  be  chosen  or  constructed  to  yield  the  most  acceptable  and 
useful  descriptions  of  the  attributes  assessed  with  minimal  contamina- 
tion from  other  attributes.  A critically  important  issue  in  making 
comparisons  is  whether  the  measurements  of  different  variables  or  from 
different  samples  can  be  expressed  in  a common  metric. 
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Certification.  A octmon  purpose  of  measurement  is  to  certify  to 
decision-makers  that  individuals  have  levels  of  attributes  appropriate 
to  specific  decisions.  A high  score  on  a licensing  examination  tells 
the  Board  of  Examiners  that  it  can  decide  to  certify  to  the  public 
that  the  person  is  competent  or  has  certain  knowledge  essential  to 
competence.  Ihe  Army  system  of  skill  qualification  testing  is  ano- 
ther example  (Maier,  Young,  & Hirshfeld,  1976) . Certification  does 
not  necessarily  indicate  anything  desirable;  a clinical  psychologist 
may  be  required,  for  example,  to  certify  to  the  court  that  a particu- 
lar person  is  inconpetent  to  stand  trial,  or  to  participate  in  his 
own  defense,  or  some  other  form  of  incompetence . In  personnel  measure- 
ment, certification  usually  is  intended  to  assure  decision-makers 
that  certain  individuals  have  (or  do  not  have)  certain  qualifications 
necessary  for  effective  performance. 

Certification  usually  implies  a dichotomous  decision.  An  indivi- 
dual will  either  be  accepted  for  a job  or  for  training  or  will  not  be 
accepted;  measurement  can  likewise  be  reduced  to  a simple  dichotomy. 

It  should  not  be  believed,  however,  that  dichotomous  scoring  eliminates 
variance  among  people  chosen;  variance,  like  the  poor,  will  be  with 
us  always.  What  is  implied  is  that,  for  some  uses  of  measurement,  the 
amount  of  variance  within  a group  may  seem  trivial.  Measurement  for 
certification  may,  therefore,  be  considered  similar  to  measurement 
for  organizational  decisions  or  for  trouble  shooting;  the  problem  may 
be  to  minimize  within-group  variance  and  maximize  be  tween-group 
variances. 

Prediction  of  Future  Status  Events  or  Performance.  All  of  the 
preceding  categories  logically  imply  a sort  of  prediction.  There  are, 
however,  many  purposes  which  may  be  explicitly  stated  in  formal  lan- 
guage as  predictive  hypotheses. 

- 12  - 


1 


Where  prediction  is  the  explicit  purpose,  two  or  more  measure- 
ments are  involved:  the  measurement  of  the  future  variable  — indivi- 
dual status  or  performance  or  the  occurrence  of  an  event  — and  the 
measurement  of  the  predictor.  The  time  element  is  an  important  part 
of  the  predictive  hypothesis,  and  the  evaluation  of  measurement  may 
include  an  evaluation  of  the  appropriateness  of  the  elapsed  time  or 
other  circumstances  under  which  the  measurements  are  taken.  Descrip- 
tive measurements  both  at  the  time  of  prediction  and  the  future  time  j 

need  evaluation.  Mast  important  is  the  need  to  evaluate  not  only  j 

the  measurement  but  the  tenability  of  the  hypothesis  itself.  j 

j 

There  is  nearly  an  infinite  variety  of  things  to  predict  in  J 

personnel  testing.  One  may  wish  to  predict  whether  training  will  be  j 

completed,  level  of  proficiency  at  the  conclusion  of  training,  or 
proficiency  or  other  forms  of  behavior  at  some  stabilizing  period  after  f 

training  has  been  completed.  Each  of  these  may  call  for  slightly  j 

different  evaluations  of  measurement.  If  one  attempts  to  measure  j 

proficiency  at  the  end  of  training,  the  measurement  may  seek  to  \ 

assess  maximum  performance  capability  with  reference  to  some  standard.  ! 

Depending  on  the  specific  hypothesis,  prediction  of  on-the-job  profi-  J 

ciency  may  require  measurement  of  either  typical  or  maximum  perform-  j 

ance.  1 

‘i 

Evaluation  of  Other  Measurement.  TO  complete  the  list  of  purposes,  » 

it  is  necessary  to  point  out  that  seme  personnel  assessment  is  done 
primarily  in  the  validation  of  other  measurement.  It  may  serve  as  a 
criterion  measurement,  as  in  prediction  of  future  performance,  or  as  , 

the  measurement  of  a hypothesis  tested  in  the  evaluation  of  construct  ’ 

validity. 

1 
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TYPES  OF  MEASUREMENT  SETTINGS 


There  is  almost  an  infinite  variety  of  situations  in  which  meas-  j 

urements  are  taken.  Each  category  below  could  be  subdivided,  some  of  j 

them  many  times,  with  an  increase  in  the  precision  with  which  settings  j 

can  be  described.  A relatively  small  nurrber  of  categories  is  used,  j 

however,  because  the  important  issue  for  personnel  testing  is  the  ' j 

degree  to  which  measurement  is  representative  of  "real  world"  situa- 
tions. The  categories  chosen  fell  on  a continuum  ranging  from  arti-  * ; 

ficial  but  highly  controlled  to  realistic  but  uncontrolled  situations. 

I 

The  higher  the  degree  of  control,  the  greater'  the  loss  of  realism  or  j 

representativeness  of  the  research  and  of  the  measurement  in  it.  | 

Nevertheless,  all  measurement  requires  some  degree  of  control  or 
there  is  no  standardization  of  measurement. 

Laboratory  Settings.  This  heading  describes  both  actual  labor- 
atories, where  full  control  of  extraneous  conditions  can  be  maintained, 
and  well-controlled  simulations.  Such  control,  in  personnel  testing, 
is  rare  except  in  experimental  studies  of  human  factors.  Measurement 
in  such  research  is  usually  concerned  with  the  evaluation  of  a compo- 
nent of  a system  rather  than  the  evaluation  of  a person  or  task  as 
such.  Individual  proficiency  in  a complex  skill.,  however,  may  be 
measured  in  laboratory-like  simulations  for  certification  purposes. 

I 

The  emphasis  is  on  the  level  of  control  rather  than  on  the  physi- 
cal attributes  of  the  setting.  It  is  possible  to  have  a highly  con- 
trolled experimental  study  under  carefully-selected  field  conditions. 

Measurement  of  certain  attributes,  such  as  physiological  processes,  ' 

nay  be  done  under  conditions  most  nearly  like  those  of  laboratory  . { 

control  regardless  of  the  physical  setting  in  which  they  occur.  Even  ’ 

i 

within  a laboratory  setting,  the  level  of  control  may  vary;  in  the 

t 

- 14  - 

If 

L* 


study  of  reaction  times,  for  exanple,  a laboratory  equipped  with 
modem  electronic  apparatus  can  achieve  a higher  level  of  control, 
and  therefore  a greater  degree  of  accuracy,  than  one  where  reactions 
are  timed  with  a stopwatch. 

Ihe  control  referred  to  in  this  discussion  is  not  experimental 
control  over  manipulations  — a major  characteristic  of  an  experiment 
— but  control  over  the  measurement  process  itself.  Without  such 
control,  attributes  other  than  the  one  being  measured  (including 
attributes  of  different  objects)  are  permitted  to  influence  the 
measurement.  With  the  highest  levels  of  control,  there  is  little 
influence  on  the  obtained  measurement  from  extraneous  sources.  For 
exanple,  the  electronic  apparatus  is  more  accurate  in  measuring  reac- 
tion time  because  it  does  not  include  error  due  to  the  speed  of 
reaction  of  the  observer. 

More  accurate  measurement  is  not  necessarily  better  measurement. 
Ihe  basic  problem  in  evaluating  measurement  under  conditions  of 
laboratory  control  is  the  problem  of  generalicab : 1 ’t/.  Does  measure- 
ment under  the  idealized,  controlled  conditions  generalize  to  "real 
world"  uncontrolled  conditions?  Ihe  question  is  an  empirical  one, 
and  its  importance  varies  with  the  opportunity  for  distortion  in 
measurement  in  either  artificial  or  clearly  uncontrolled  situations. 
What  is  at  issue' is  the  Brunswickian  notion  of  representative  design. 
Measurement  taken  under  the  relatively  sterile  conditions  of  labora- 
tory settings  may  lack  representativeness,  and  the  laboratory  may 
therefore  introduce  its  own  error  by  influencing  the  behavior  or 
variable  under  study. 

Settings  of  Institutional  Oontrol . ihis  rather  peculiar  term  is 
intended  as  an  uirbrella  term  covering  employment  offices,  clinics, 


training  centers,  and  other  settings  in  which  meaurement  is  taken 
under  standardized  (if  not  really  controlled)  conditions  — conditions 
which  include  the  awareness  of  the  subject  being  measured  that  insti- 
tutional decisions  are  going  to  be  based  on  the  results.  Standardiza- 
tion implies  certain  conventional  concerns,  such  as  consistency  in 
tine  limits,  instructions,  formats,  etc.  There  are  other  concerns, 
however,  that  have  not  been  handled  particularly  well  in  the  psycho- 
metric literature.  For  example,  are  testing  conditions  standardized 
when  the  sane  instructions  are  read  to  all  people  to  be  tested,  or 
when  all  of  the  people  to  be  tested  have  been  brought  to  some  common 
level  of  understanding?  Answers  to  such  questions  may  well  determine 
the  success  in  minimizing  unwanted  influence  on  measurements. 

Field  Settings.  Realistic  field  situations  can  be  described  on 
several  dimensions.  One  might  be  the  number  of  constraints  on  perform- 
ance imposed  by  the  environment;  in  some  environments  one  may  perform 
a wider  range  of  tasks,  or  perform  them  with  more  difference  in 
quality,  than  in  more  constraining  settings.  Some  settings  are 
supportive  and  facilitate  performance  of  the  measurement  task;  others 
are  hostile  environments  which  make  it  difficult  to  perform  well. 
Subjectively,  environments  fall  along  a continuum  ranging  from 
pleasant  to  unpleasant  settings,  or,  alternatively,  motivating  as 
opposed  to  inhibiting  conditions. 

The  purposes  of  measurement  imply  the  kinds  of  real-life  condi- 
tions to  which  the  results  are  expected  to  generalize,  and  they  also 
determine  whether,  under  those  conditions,  one  wants  to  infer  maximum 
or  typical  performance.  It  is  obvious  that  some  situations  place  a 
limiting  influence  on  performance?  conditions  of  measurement  may  need 
to  include  similar  influences.  Other  consequences  of  the  setting 
include  effects  on  performance  standards  or  on  what  nay  be  expected 
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as  typical  performance.  Field  settings,  in  short,  provide  numerous 
sources  of  influence  on  obtained  measurements.  These  influences 
across  settings  may  not  be  consistent  from  one  individual  to  another; 
scores  obtained  in  different  settings  need  to  be  compared  for  means, 
variance,  and  correlations  to  determine  whether  inferences  from  scores 
generalize  from  one  setting  to  another. 

TYPES  OF  VARIABLES:  ATTRIBUTES  OF  PEOPLE 


Many  kinds  of  variables  are  measured  in  psychological  research, 
including  attributes  of  organizational  and  physical  climates,  archi- 
tectural variables,  tangible  objects,  social  relationships  and  many 
other  stimuli  or  behavioral  outcomes.  For  convenience,  the  discussion 
here  will  be  restricted  to  attributes  of  people  and  to  attributes  of 
the  tasks  they  are  asked  to  do. 

The  infinite  variety  of  attributes  of  people  have  been  organized 
below  in  seven  categories.  The  categories,  which  certainly  are  not 
exhaustive,  seem  less  important  than  the  order  in  which  they  are  pre- 
sented. The  presentation  begins  with  a class  of  variables  most 
amenable  to  objective  measurement  and  concludes  with  variables  for 
which  little  or  no  objectivity  in  measurement  can  be  claimed. 

Objectivity  in  psychological  measurement  is  an  elusive  concept. 

It  certainly  should  not  be,  as  is  commonly  done,  confused  with  a 
multiple-choice  format.  The  topic  will  be  reexamined  later.  For  the 
present,  modifying  an  earlier  discussion  (Guion,  1965) , three  consider- 
ations may  facilitate  objectivity  in  measurement: 
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1.  Objectivity  is  facilitated  by  responses  which  can  be 
empirically  verified  against  some  external  standard  as 
opposed  bo  qualitative  or  evaluative  responses  of 
un verifiable  substance. 
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Objectivity  is  facilitated  by  responses  which  are  free  or 
unconstrained,  where  the  respondent's  own  preferred  alter- 
natives may  be  expressed,  as  opposed  to  responses  which  are 
restricted  or  structured  by  the  measurement  process  itself 
(Ihur stone,  1948). 

3.  Objectivity  is  facilitated  by  responses  not  easily  or  likely 
to  be  distorted,  as  opposed  to  responses  distorted  by  delib- 
erate faking,  anxiety  about  the  purposes  of  the  testing,  etc. 

The  oomtcn  element  in  these  is  a natter  of  inference.  Inferences 
can  be  made  with  more  confidence , and  in  fact  are  smaller  inferences, 
if  based  on  responses  that  can  be  declared  accurate,  or  are  free  from 
format  constraints,  or  are  not  distorted  in  other  ways.  On  the  other 
hand,  inferences  are  shaky  indeed  from  faked  reports  of  internal 
states  or  from  responses  which  fit  the  format  but  give  the  respondent 
no  option  for  the  response  that  would  be  a better,  more  accurate,  or 
more  honest  response. 

Physiological  Processes.  In  personnel  testing,  physiological 
variables  are  rarely  considered  except  in  human  factors  or  stress 
research.  Nevertheless,  it  is  instructive  to  consider  hew  such 
variables  can  be  measured.  Examples  might  include  such  diverse 
variables  as  respiratory  rate  or  capacity;  pulse,  blood  pressure,  or 
other  cardiovascular  measures;  metabolic  rates  or  chemical  concentra- 
tions; visual,  auditory,  or  cutaneous  acuity  or  sensitivity;  and 
others.  Measures  of  such  variables  are  often  fundamental  or  mathe- 
matically formally  derived  measurements.  They  may  be  measured  by 
counting  or  in  physical  units. 

It  is  important  to  be  clear  about  the  variable  being  measured  as 
distinguished  from  the  variable  that  might  be  inferred  from  the 
measurement.  If  we  are  concerned  about  the  effect  of  a program  of 
exercise  on  cardiovascular  functioning  because  the  purpose  of  the 
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program  is  to  improve  cardiovascular  functioning,  for  example,  we 
measure  such  functions  sinply  as  variables  to  be  interpreted  on  their 
own  terms.  Frequently,  however,  we  may  be  interested  in  the  same 
measurements  as  a basis  for  other  kinds  of  inference.  For  example, 
research  on  reactions  to  stressful  environments  may  measure  the  same 
cardiovascular  functions  for  inferences  about  levels  of  anxiety,  a 
distinctly  different  type  of  variable. 

Motor  Skills.  Ibis  category,  too,  is  concerned  with  biological 
functioning;  it  differs  in  that  its  variables  are  peripheral,  usually 
directly  observable  behaviors;  that  is,  the  variables  do  not  have  to 
be  inferred  from  readings  of  instruments.  Variables  included  in  this 
category  include  dexterities,  coordination,  strength,  and  other 
patterns  of  muscular  behavior. 

In  personnel  testing,  these  variables  are  most  likely  to  be 
measured  as  predictors  in  selection  systems  or  for  research  on  safety. 
Ibis  category,  and  the  preceding  one,  may  on  occasion  be  measured  as 
aspects  of  work  sample  perfcri,_nce;  a work  sample  test  for  firefighters, 
for  example,  may  consist  of  timing  the  speed  with  which  a candidate 
can  climb  a ladder  and  return.  An  inference  is  involved,  but  it  is 
such  an  easy,  direct  one  that  it  is  not  often  questioned;  it  is 
easy  to  infer  skill  in  doing  something  but  it  is  inadvisable  to  infer 
a lack  of  skill  from  poor  performance.  Ihe  assumption  is  that  one 
cannot  perform  well  without  skill,  but  lack  of  skill  is  only  one 
of  many  reasons  why  one  would  perform  poorly. 

Performance  Variables.  Ibis  is  an  extremely  broad  category, 
including  most  overt  behavior.  It  includes,  but  is  not  limited  to 
all  measures  of  proficiency,  speed  or  quality  of  performance,  evalua- 
tions of  work  products,  ineffective  or  disruptive  performance,  or 

- 19  - 


wrr  z.r'r 


certain  kinds  of  performance  habits  or  styles  — approaches  to  carry- 
ing out  tasks.  Such  variables,  whether  defined  in  terms  of  maximum 
or  of  typical  performance,  are  most  often  used  in  the  role  of  criteria 
or  dependent  variables.  They  may  also  be  used  as  predictors  or  as 
bases  for  instruments  certification  decisions.  Measures  of  attributes 
of  actual  behavior  may  be  the  basis  for  certification  of  proficiency 
or  acceptability,  or  level  of  proficiency  may  be  inferred  from  measure- 
ments using  indirect  indicators.  Work  sarrples,  in  most  cases,  are 
examples  of  performance  measures,  but  so  also  are  the  ubiquitous  rat- 
ings by  supervisors.  Performance  is  usually  an  objective  fact,  but 
it  does  not  necessarily  follow  that  its  attributes  can  be  easily  or 
objectively  measured. 

Job  Knowledge.  Closely  related  to  the  measurement  of  proficiency 
is  the  measurement  of  the  knowledge  required  to  become  proficient. 
Often,  although  sometimes  erroneously,  job  knowledge  tests  are  used 
for  drawing  inferences  of  proficiency.  This  use  of  job  knowledge 
variables  needs  to  be  recognized  as  an  example  of  a formal  hypothesis; 
that  is,  it  is  hypothesized  tiiat  a measure  of  test  proficiency  is  a 
function  of  measured  job  knowledge.  The  hypothesis  may  often  be 
tenable,  particularly  in  highly  complex  jobs,  but  it  usually  deserves 
an  empirical  test. 

Of  the  categories  so  far  mentioned,  this  is  the  first  in  which 
conventional  principles  and  methods  of  test  construction,  following 
classical  psychometric  theory,  cure  easily  used.  Psychometric  princi- 
ples are  rarely  considered  in  the  techniques  for  measuring  physiologi- 
cal processes.  It  is  true  that  psychometric  evaluations  of  reliability 
and  validity  are  commonly  applied  to  measures  of  dexterity  and 
coordination,  and  they  are  frequently  given  lip  service  in  measuring 
aspects  of  performance.  Nevertheless,  this  category  is  the  first  in 


the  list  in  which  there  are  individual  items  that  can  be  clearly 
clustered  into  internally  consistent  dimensions,  the  kind  of  items 
for  which  classical  theoretical  propositions,  such  as  the  Spearman- 
Brown  formula  or  the  theoretical  foundations  for  definitions  of 
parallel  test,  were  created. 

Cognitive  Variables.  The  history  of  mental  measurement  is 
largely  a history  of  the  measurement  of  cognitive  processes.  It  began 
with  the  measurement  of  intelligence  (or  "genius") , and  much  of  its 
progress  has  occurred  through  refinements  in  the  methods  of  measuring 
intellectual  functioning.  Intellectual  functioning  is  generally  con- 
sidered a form  of  information  processing,  the  principal  preoccupa- 
tion of  cognitive  psychology. 

Typically,  cognitive  variables  in  personnel  practice  are  measured 
through  the  use  of  paper-and-pencil  tests.  In  other  areas  of  psychol- 
ogy, there  is  evidence  of  discontent  with  this  form  of  measurement. 
Lunneborg  (1977)  reported  a series  of  three  studies  using  laboratory 
measures  of  reaction  time  correlated  with  standard  paper-and-pencil 
tests.  The  correlations  were  rather  lew,  but  the  attenpt  to  under- 
stand conventional  test  performance  in  the  language  of  cognitive 
processes  seemed  intriguing.  Cognitive  variables  are  among  the  most 
commonly  used  predictors  in  personnel  selection  and  classification 
programs?  atterrpts  to  measure  individual  differences  in  these  vari- 
ables that  utilize  cognitive  theory  and  research  should  be  watched 
with  interest. 

Aspects  of  Personality  or  Temperament.  Attempts  to  measure 
characteristics  of  personality  have  been  highly  varied;  they  include 
personality  inventories,  projective  procedures  ranging  from  ink  blots 
to  sentence  oorrpletion  forms,  and  procedures  for  inferring  personality 
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characteristics  from  such  objective  data  as  suggestibility  during  an 
experiment,  etc.  More  recently,  attempts  to  assess  such  variables 
through  assessment  center  exercises  have  become  quite  popular. 


These  variables  differ  from  those  under  all  previous  headings; 
they  are  less  a matter  of  what  a person  can  do  than  of  what  a person 
will  do.  The  emphasis  is  motivational,  and  it  has  no  objective 
referent.  Characteristics  of  personality  and  temperament  are  there- 
fore evaluated  against  normative  standards. 


Attitudes.  The  measurement  of  attitudes  involves  assessing 
affective  reactions  to  a wide  variety  of  environmental  characteristics. 

Attitude  scales  may  be  developed  by  scaling  checklist  statements, 

writing  single  item  questions  with  graphic  rating  scales  or  other  ad  \ 

hoc  collections  of  intuitively  scaled  response  options,  or  by  using  j 

i 

the  method  of  surma  ted  ratings  on  a series  of  such  questions  or  i 

checklist  statements.  The  most  common  example  of  attitude  measure-  ] 

ment  in  personnel  testing  is  the  measurement  of  job  satisfaction  and  \ 

related  reactions  to  work  and  work  settings.  1 

i | 

* ! 

/ J 

The  level  of  blood  sugar  in  a given  sanple  of  blood  coastitutes  j \ 

an  empirically  verifiable  fact;  there  is  no  way  in  which  the  level  of  , \ 

job  satisfaction  of  an  individual  in  a given  setting  can  be  considered  1 \ 

similarly  verifiable.  Moreover,  the  methods  of  measurement  of  i j 

/ i 

attitudes  rarely  permit  free  responses ; the  responses  typically  are  » 1 

constrained  by  one  of  the  formats  mentioned  above.  Moreover,  as  I 

people  try  to  interpret  the  purposes  of  the  measurement,  or  fear  that  . 

their  responses  can  be  identified  and  used  against  them,  there  is  a : 

strong  probability  that  responses  will  be  consciously  distorted.  In  | 

all  respects,  the  measurement  of  attitude  seems  to  be  the  least  J 

objective  of  any  of  the  variables  in  this  list.  * ] 
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TYPES  OF  VARIABLES:  ATTRIBUTES  OF  TASKS 


The  long  history  of  measuring  attributes  of  people  has  made  it 
possible  to  organize  variables  describing  people  in  a fairly  coherent 
way.  There  is  no  comparable  history  in  the  measurement  of  task  charac- 
teristics, although  the  kinds  of  variables  to  be  sampled  in  developing 
work  sanples  makes  task  variables  extremely  important  to  the  present 
paper.  Very  briefly,  nine  categories  of  task  variables  can  be 
suggested.  An  attempt,  tentative  and  faltering,  has  been  made  to 
suggest  again  a rough  order  of  objectivity  or  verifiability,  but  no 
definition  of  objectivity  is  offered.  The  earlier  treatment  of 
objectivity  in  terms  of  responses  is  clearly  not  applicable. 


Duration  or  Intensity  of  Attention.  Some  tasks,  for  example  that 
of  the  air  traffic  controller,  require  a constant  and  unwavering 
vigilance  for  prolonged  periods.  Other  tasks  require  less  intense 
attention,  and  even  that  needs  to  be  maintained  for  only  brief  periods. 
Variables  might  differ  according  to  the  sensory  modalities  involved, 
the  focus  of  attention,  or  the  nature  and  costs  of  the  consequences 
of  inattention.  Some  of  these  variables  nay  relate  more  to  cognitive 
than  to  sensory  processes,  such  as  the  number  or  complexity  of  details 
that  must  be  comprehended  or  manipulated,  or  the  degree  to  which  the 
task  demands  attention  to  fact  as  opposed  to  attention  to  broad 
generalization . 


Hazards.  Physical,  social,  or  economic  risks  nay  be  intrinsic 
components  of  certain  tasks.  Such  variables  need  to  be  considered 
very  carefully  in  the  development  of  work  sanple  measures ; a work 
sample  designed  to  assess  the  performance  of  a police  officer  in 
making  an  arrest  nay,  for  exanple,  be  severely  distorted  if  the 
sanple  involves  simulated  conditions  in  which  the  officer  knows  there 
is  no  chance  of  being  shot. 
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Degree  of  Task  Structure.  Perhaps  one  of  the  most  widely  studied 
attributes  of  tasks  is  the  degree  of  uncertainty  (or  its  opposite, 
structure) . In  sane  tasks  the  outcome  of  performance  is  highly  predict- 
able. That  is,  one  knows  very  clearly  that  doing  the  task  in  one  way 
leads  surely  to  specified  errors,  whereas  performance  in  a different 
way  leads  to  acceptable  work  products.  In  contrast,  other  tasks,  such 
as  artistic  or  craft  tasks,  are  often  carried  out  with  very  little 
assurance  that  the  result  will  be  the  one  intended. 

Organizational  Involvement.  Sane  tasks  can  be  done  in  nearly 
total  isolation.  Other  tasks  require  a worker  to  receive  material  or 
ideas  from  other  people  and  may  also  influence  work  of  other  people? 
examples  include  assembly  line  activities,  team  activities,  etc. 
Organizational  involvement  may  be  a single  variable  which  can  be 
measured  in  terms  of  the  number  of  necessary  interactions  with  other 
people  in  an  organization  required  to  perform  a task  satisfactorily; 
alternatively,  it  may  be  analyzed  into  component  variables  as  differ- 
ent organizational  entities  as  the  locus  of  involvement. 

Task  Complexity.  Variables  under  this  heading  include  the 
level  of  knowledge  and  skill  required  to  carry  out  the  task,  the 
variety  of  skills  demanded,  the  nutter  or  complexity  of  choices  or 
decisions  that  might  have  to  be  made,  the  level  of  accountability  or 
damages  in  the  case  of  inadequate  performance,  or  even  the  learning 
time  required  to  perform  the  task  effectively.  It  is  possible  to 
develop  a work  sanple  test  using  performance  on  relatively  simple 
tasks  as  a basis  for  inferences  about  performance  on  a more  complex 
task.  Doing  so  inplies,  again,  a hypothesized  relationship  between 
performances  on  the  siirple  and  catplex  tasks,  and  that  hypothesis 
needs  to  be  tested  before  its  tenability  is  assumed. 
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I Intrinsic  Feedback.  On  sate  tasks,  a 'worker  can  obtain  informa- 

! tion  about  how  well  he  is  doing  the  task  as  he  is  doing  it.  One  who 

i 

is  cutting  a piece  of  wood  or  metal  on  a lathe,  for  exanple,  can 
j periodically  check  the  dimensions  against  the  specifications  with 

calipers  and  can  evaluate  his  work.  If  one  is  using  expendable  tools, 
; such  as  saw  blades,  and  one's  rate  of  wear  or  breakage  is  excessive 

relative  to  some  standard,  he  can  be  aware  of  the  flaw  in  performance 
without  being  told  by  an  independent  observer  or  supervisor.  In 
: other  tasks,  feedback  about  quality  of  performance  is  long  delayed 

| and  nay  sometimes  be  filtered  through  several  processes;  sometimes 

j it  conies  only  from  the  subjective  judgments  of  peers  or  supervisors . 

Work  sanple  testing  appears  to  be  more  easily  directed  toward  tasks 
with  opportunities  for  seme  intrinsic  feedback. 

j 

j One  set  of  feedback  variables  may  relate  to  the  size  of  the 

task  unit.  The  amount  of  time  or  number  of  cycles  required  to  com- 
plete a unit  of  work,  the  frequency  of  interrupted  tasks,  the  oppor- 
tunities to  set  goals,  the  tempo  or  pace  of  the  work  — all  of  these 
influence  the  degree  of  feedback  one  gets  in  performing  tasks. 

(For  a discussion  of  these  variables,  see  Ityan  & Smith,  1954.)  Once 
again,  the  importance  of  such  variables  in  psychological  measurement 
by  work  sample  tests  is  that  work  sanple  tasks  should  have  feedback 
properties  similar  to  those  of  the  work  being  sampled. 

Skill  Demands.  Ibis  category  includes  motor,  sensory,  and 
cognitive  skills  (and  perhaps  even  attitudes)  that  are  clearly 
prerequisite  to  effective  task  performance.  For  some  of  these 
variables  the  task  may  demand  quite  high  levels;  for  other  variables, 
the  level  of  ability  demanded  by  the  task  may  be  much  lower.  Uiese 
[ variables  have  special  implications  for  work  sample  testing  to 

* whatever  extent  they  change  over  time.  Changes  in  the  skill  demands 

£ 

I 

i 
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of  the  job  nay  correlate  with,  but  should  not  be  confused  with  changes 
in  the  skills  of  a person  doing  the  job  (Alvares  & Hulin,  1973) . 

Some  changes  in  skills  applied  in  the  performance  of  a job  occur  with 
accumulated  learning  through  experience;  if  this  happens,  the  advis- 
ability of  work  sample  testing  of  inexperienced  people  should  be 
questioned. 

Significance.  This  category  is  intended  to  include  any  variables 
which  evaluate  the  importance  of  task  outcomes.  It  nay  include  the 
importance  of  the  task  as  an  influence  on  the  performance  or  satisfac- 
tion of  other  people  within  the  organization,  it  may  be  an  element  of 
importance  for  society  at  large,  or  it  nay  involve  importance  for 
client  or  customers  of  the  organization. 

Autonomy.  Some  tasks  can  be  jjerformed  by  the  worker  without 
supervision  or  advice  from  other  people;  others  must  be  done  with  close 
supervision  or  consultation.  Autonomy  is  the  degree  to  which  the 
worker  is  free  to  do  the  task  without  the  permission  or  advice  of  some- 
one else.  Another  kind  of  autonomy  might  be  defined  as  the  worker's 
degree  of  discretion  in  making  decisions;  there  nay  be  different  levels 
of  discretion  for  different  kinds  of  decisions  about  the  way  tasks  are 
to  be  performed  or  the  sequence  to  be  followed  in  performing  them. 

Or,  autonomy  might  be  the  mirber  of  tasks  that  can  be  completed,  or 
the  period  of  time  one  nay  continue  to  work,  without  seeking  author- 
ization. Or,  it  might  be  the  level  of  the  worker's  control  over  such 
things  as  pace,  or  sequence  of  activities,  or  quality  or  quality 
standards. 


TYPES  OF  MEASUREMENT  METHODS 


The  methods  for  measuring  task  attributes  are  related  to  tliose 


for  measuring  the  attributes  of  people;  what  differs  is  the  nature 
of  the  inference  drawn.  Although  a variable  such  as  the  degree  of 
physical  hazard  may  be  determined  by  counting  accidents,  it  is  more 
often  assessed  by  someone's  judgment  or  perception  — a cognitive 
process  of  the  observer. 

All  measurement  of  psychological  attributes  begins  with  the 
observation  of  the  responses  people  make  to  specific  stimulation. 
Differentiation  among  measurement  techniques  is  necessarily  based 
on  the  nature  of  the  observational  aids  used  and  on  the  manner  of 
recording  responses  and  transforming  them  into  measurement. 

Five  categories  are  listed.  Cnee  again,  these  categories  are 
listed  in  the  order  in  which  they  permit  objectivity  in  measurement 
or,  conversely,  in  the  reverse  order  of  the  magnitude  of  inferential 
leaps  necessary  for  the  evaluation  or  interpretation  of  data.  Again, 
as  before,  the  categories  follow  this  order  as  a natter  of  conven- 
ience, not  as  a matter  of  invariance. 


Instrumentation . Instrumentation  as  used  here  refers  to 
equipment,  such  as  mechanical,  electronic,  or  optical  aids  for  obser- 
vation. People  may  respond  to  an  emotional  stimulus  with  an  increase 
in  the  moisture  oontent  of  the  skin  surface.  Except  in  the  strongest 
emotional  states,  howevei /'these  increases  nay  be  imperceptible  • 
without  the  aid  of  galvanometers. 


Many  physiological  responses  are  measured  on  standard  polygraph 
instruments.  Most  psychological  laboratories  boast  an  array  of 
solid  state  electronic  circuitry  for  the  measurement  of  reaction  time 
that  would  have  seemed  like  science  fiction  to  the  psychologist  hold- 
ing a stopwatch  a mere  quarter  of  a century  ago.  Sophistication  in 
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research  and  sophistication  in  instrumentation  have  developed  in 
tandem. 


Instrumentation  is  ccrmonplace  in  measuring  sensory  capacities 
or  reaction  times  or  choices.  For  the  former,  it  is  especially 
helpful  in  the  presentation  of  stimulus  materials,  while  the  latter 
uses  instrumentation  to  magnify,  clarify,  count,  or  record  responses 
or  characteristics  of  responses.  The  instruments  may  be  highly 
sophisticated  or  quite  simple.  They  may  often  be  developed  specifi- 
cally for  particular  measurement  problems.  For  example,  Gessewein  & 
Oorrao  (1971)  developed  special  apparatus  to  study  the  possibilities 
of  leg  fractures.  Their  purpose  was  to  develop  a family  of  curves 
to  provide  designers  with  the  means  of  predicting  those  conditions 
under  which  Naval  personnel  on  ships  would  be  likely  to  receive 
fractures;  the  variable  to  be  measured  was  the  force  of  inpact  as  a 
person  fell  from  various  heights , and  the  technique  of  measurement 
was  to  have  subjects  drop  stiff-legged  onto  a force  gauge  platform. 

Instrumentation  is  often  used  in  inferring  work  sanple  proficiency 
through  measuring  characteristics  of  the  work  product.  In  a work 
sanple  requiring  the  subject  to  nake  solder  connections,  for  exanple, 
the  quality  of  response  might  well  be  measured  by  measuring  the 
conductivity  of  the  solder  connections  themselves  rather  than  by 
measuring  responses  directly.  If  a piece  of  metal  is  to  be  machined 
to  specifications,  the  resulting  product  can  be  measured  with  any- 
thing from  a ruler  to  laser  beams  to  determine  whether  the  product 
is  within  tolerances. 

Direct  Observation  and  Recording.  This  category  is  best  illus- 
trated by  research  in  applied  behavioral  analysis  which  requires 
observers  to  count  frequencies  of  specified  behaviors.  Just  as 
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measurement  techniques  with  instrumentation  vary  greatly  in  sophisti- 
cation, so  also  measurement  by  direct  observation  varies  greatly  in 
the  clarity,  detail,  and  precision  of  instructions  to  observers  and 
in  the  precision  with  which  their  observations  may  be  recorded.  Under 
many  circumstances,  sane  form  of  instrumentation  may  be  a portion  of 
the  recording  process.  That  is,  the  observer  may  make  frequency 
counts  either  by  making  tally  marks  on  a piece  of  paper  or  by 
pressing  a button  activating  a counter. 

A less  exact  form  of  measurement  by  observation  is  used  in  many 
assessment  center  exercises.  The  observers  may  have  no  specific 
behaviors  to  count;  instead,  they  may  be  instructed  to  observe  and 
write  down  "any  salient  behavior."  At  the  conclusion  of  the  exercise, 
the  observer's  record  may  consist  both  of  such  narrative  descriptions 
and  an  evaluative  rating  of  the  behavior  observed. 

Records  and  Biographical  Data.  Many  variables,  of  which  atten- 
dance is  perhaps  the  best  example,  are  measured  by  frequency  counts 
obtained  not  by  direct  observation  but  by  examination  of  recorded 
data.  Many  kinds  of  records  are  maintained  in  most  organizations. 

If  they  are  maintained  consistently  and  accurately,  they  provide 
useful  data  sources  for  the  development  of  a variety  of  measures. 
Therein,  of  course,  lies  the  rub;  most  systems  of  personnel  account- 
ing are  notoriously  poor.*  'It  is',  however,  possible  to  dev’elop  and  * * • 
maintain  effective  ad  hoc  record  systems  for  periods  of  perhaps 
several  months. 

Measures  of  many  kinds  of  variables  may  be  derived  from  data 
maintained  in  records.  For  example,  records  may  contain  frequency 
counts  of  production  and  nay  also  indicate  periods  of  time  away  from 
the  principal  assignment  when  a worker  cannot  be  expected  to  be 
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productive.  By  oarbining  the  two  sets  of  data,  derived  measures  of 
productivity  per  hour  or  per  day  can  be  developed.  If  situational 
factors  influence  daily  average  productivity,  records  can  be  organized 
so  that  distributions  of  productivity  in  different  situations  can  be 
determined  with  individual  production  records  standardized  in  terms 
of  those  distributions. 

Records  are  kept  in  memory  banks,  be  they  file  drawers,  oorrpubers, 
or  human  memories.  If  the  memory  bank  is  in  a computer , it  is  simply 
a form  of  storage.  However,  data  stores  in  the  memory  of  an  indivi- 
dual is  often  changed  in  "storage"  and  retrieval  processes.  Many 
variables  are  measured  by  asking  individuals  to  pull  from  the  records 
of  their  own  memories  information  which  can  be  scaled,  counted,  or 
classified.  It  is  in  this  context  that  the  major  difficulty  in  such 
measurement  comes  into  clear  focus:  the  accuracy  of  records  must 
always  be  suspect.  Records,  whether  frcm  the  memory  of  individuals 
or  frcm  files,  suffer  from  variations  in  carefulness,  in  organizational 
procuedures,  in  the  interpretations  of  nurbers,  and  in  many  other 
ways  that  distort  their  accuracy. 

Ibs ting.  Personal  attributes  of  people  are  most  often  measured 
by  asking  them  questions  and  recording  the  answers  to  those  questions; 
this  is  certainly  the  most  common  measurement  technique  in  personnel 
research.  Sometimes  the  questions  are  actually  assignments  ("Solve 
this  problem"  or  "Assemble  that  gadget"),  but  the  prototype  of  this 
form  of  measurement  is  the  multiple-choice  test  item.  The  stimulus 
material  is  the  question  asked  or  implied  in  the  stem,  and  the 
response  is  the  choice  of  the  option  considered  correct.  If  the 
item  has  a genuinely  correct  answer,  as  in  an  arithmetic  problem, 
the  correctness  of  response  is  highly  verifiable  and  such  tests  are 
usually  called  objective.  Ihere  is  less  verifiability  of  the 
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i correctness  of  the  response  when  the  question  deals  with  the  subject's 

own  typical  behavior.  A question  might,  for  example,  ask  the  subject 

i 

hew  he  prefers  to  spend  his  spare  time.  The  optional  answers  might 
j include  responses  such  as  reading  a good  book,  going  to  an  art 

museum,  attending  a symphony  orchestra  concert,  or  watching  situational 
comedies  on  television.  Many  people  will,  of  course,  literally  spend 
i more  time  watching  situation  comedies  if  for  no  reason  other  than  the 

{ ready  availability  of  a television  set;  symphonies,  art  museums,  and 

| good  books  may  not  be  as  accessible.  The  question,  of  course,  does 

j not  ask  a factual  question  of  ha-;  one's  time  is  literally  spent;  it 

l 

I asks  hew  the  subject  likes  to  spend  his  time,  and  the  response  to 

| that  question  is  not  at  all  verifiable.  Only  the  subject  himself 

]aiows  his  own  preferences,  and  he  may  not  be  sure  of  them.  Even  if 
he  is  sure,  he  nay  not  be  truthful.  If  he  actually  prefers  situation 
! comedies  over  concerts,  he  may  nevertheless  respond  that  he  would 

prefer  to  go  to  a concert  sinply  because  in  the  testing  situation  he 
perceives  this  to  be  a more  socially  desirable  response.  Since  there 
? is  no  direct  way  to  determine  whether  an  individual  has  responded 

: honestly  to  the  question,  or  even  whether  there  is  a clear-cut 

answer,  such  testing  is  considered  highly  subjective. 

j 

Although  the  written  multiple- choice  question  is  a prototype, 
it  is  by  no  means  the  only  approach  to  measurement  by  question  and 

i 

answer  techniques.  In  determining  how  well  an  individual  might  be 
l able  to  detect  salient  stimuli  in  the  midst  of  irrelevant  but  perva- 

< sive  stimulation,  the  question  might  be,  "In  which  quadrant  is  the 

target  stimulus?"  referring  to  a projection  on  a screen.  Questions 
in  any  form  must  be  phrased  appropriately.  In  the  familiar  Snellen 
' Eye  Chart,  for  example,  the  "question"  may  be,  "Can  you  read  the 

next  line?"  It  is  not  appropriate  for  the  subject  to  answer  with  a 

i yes  or  no;  such  flippancy  can  be  avoided  by  sinply  assigning  the 

•> 

i 
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reading  as  a task:  "New  read  the  next  line." 


Ratings.  When  all  else  fails,  or  when  energy  or  imagination  is 
lacking  to  suggest  anything  better,  psychological  measurement  consists 
of  ratings.  Some  form  of  rating  (or,  more  generally,  subjective 
evaluation)  is  the  most  ocmonly  used  method  of  measuring  performance 
and  related  behavioral  variables.  The  basic  rating  system  consists 
of  a format  for  recording  subjective  evaluations  of  designated  stimulus 
objects  or  items;  the  familiar  graphic  rating  scale  is  only  one  example. 

In  fact,  better  exanples  involve  both  descriptions  of  observa- 
tions as  a basis  of  evaluation  and  the  evaluation  itself.  Ihe  observer 
nay  note  behaviors  and  either  rate  the  behaviors  along  some  designated 
scale  or  consider  them  in  rating  the  ratee  on  a pre-determined  dimen- 
sion. Occasionally,  the  observations  themselves  form  a rating  scale. 
Much  research  in  developmental  psychology  or  in  animal  research  requires 
observers  to  check  one  descriptive  behavior  statement  o^/erved  among 
a list  of  behavior  statements  that  have  been  previously  scaled. 

Ratings  are  often  not  based  on  systematic  observations . Periodic 
efficiency  reports  or  other  methods  of  performance  evaluation  frequently 
consist  of  ratings  based  on  the  vague  iupressions  of  superiors  who 
may  never  have  had  an  opportunity  to  observe  the  subordinate's  behavior 
directly.  Research  on  this  ubiquitous  use  of  ratings  casts  consider- 
able doubt  on  their  utility. 

Serious  question  may  also  be  directed  to  the  many  forms  of  self- 
rating used  in  psychological  measurement.  Many  personality  inventories 
of  a question-and-answer  form  require  that  answer  to  be  given  in 
terms  of  a scaled  response.  An  item  describing  a particular  form  of 
behavior  might,  for  exanple,  call  for  response  options  scaled  in  four 
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steps:  "very  much  like  me,"  "somewhat,  like  me,"  "not  very  much  like 
me,"  "not  at  all  like  me."  This,  too,  is  a subjective  judgment  in 
which  the  response  requires  a rating  along  a scale.  Subjects  may  often 
be  given  simply  the  assigned  task  to  rate  themselves  on  specific 
dimensions  — again  with  the  rating  to  be  placed  on  a designated  form. 

The  objectivity  of  ratings,  or  their  verifiability,  depends  pri- 
marily on  the  nature  of  the  stimulus  material.  Subjective  ratings, 
or  discriminations,  are  called  for  in  any  psychophysical  measurement, 
such  as  an  eye  examination,  yet  these  nay  be  treated  as  relatively 
objective.  In  contrast,  an  instruction  to  rate  someone  on  "quality 
of  performance"  is  far  too  ambiguous  to  permit  an  interpretation  of 
objectivity.  Moreover,  the  objectivity  of  ratings  depends  largely 
on  the  raters'  desire  for  objectivity;  many  forms  of  bias,  ranging 
from  the  self-protection  of  a central  tendency  response  bias  to 
overt  prejudice  may  influence  recorded  ratings. 

IMPLICATIONS  OF  THE  CLASSIFICATICNS 

The  classification  schemes  described  in  the  preceding  section 
may  prove  unwieldy  or  ambiguous  if  they  were  used  to  classify  actual 
studies;  it  has  not  been  empirically  tried.  A desirable  next  step 
would  be  to  ask  different  expert  judges  independently  to  fit  real 
exanples  into  the  categories  described.  If  specific  uses  can  be 
classified  easily  and  reliably,  support  for  the  taxonomy  vrould  be 
inferred;  unreliability  in  classification  would  identify  needs  for 
modification. 

For  the  present  purposes,  however,  no  tightening  of  the  taxonomy 
is  necessary.  These  categories  may  not  be  optimal,  but  they  are  at 
least  indicative;  their  implications  for  the  construction  and  evaluation 


of  new  testing  programs  will  not  differ  substantially  from  those  of 
an  enpirically  modified  scheme. 


In  this  section  of  the  report,  implications  will  be  considered 
first  for  each  of  the  different  classification  schemes;  they  will  then 
be  considered  for  combinations  of  classifications. 


IMPLICATIONS  OF  PURPOSES 


1.  For  all  purposes,  measurement  leads  to  decisions,  and  these 
in  turn  at  least  inply  some  prediction  of  outcomes  of  the 
decisions. 

2.  Wbrk  sanples  my  be  relevant  for  any  purpose,  either  as 
dependent  variables  or  as  independent  variables. 

3.  No  class  of  purposes  imposes  restrictions  to  particular 
kinds  of  measurement.  Although  measurement  of  seme  aspect 
of  performance  is  commonly  intended  for  many  of  these  pur- 
poses, it  can  be  based  either  on  fundamental  descriptive 
measurement  or  on  measurement  requiring  greater  inferential 
leaps.  Measurement  in  program  evaluation  for  organizational 
decisions,  or  measurement  calling  for  the  certification  of 
proficiencies,  sliould  in  general  need  smaller  or  easier 
inferences  than  do  measurements  for  other  purposes. 

4.  Ihe  different  purposes  inpose  no  special  restrictions  on 
the  kinds  of  variables  to  be  assessed;  both  task  variables 
and  person  variables  need  to  be  assessed  in  meeting  many  of 
these  purposes. 

5.  Measurement  techniques  which  rraximize  variance  may  be  used 
for  any  of  the  types  of  purposes  and  are  highly  to  be  desired 
for  most. 

6.  Measurements  taken  for  decisions  about  groups  (primarily  in 
evaluations  of  material,  processes  or  groups,  but  sometimes 
in  organizational  trouble  shooting) , should  provide  signifi- 
cant group  differentiation.  Ihe  principle  may  also  apply 
to  certification  (for  exanple,  to  differentiate  masters 
from,  nonmasters) , but  only  if  the  groups  are  very  carefully 
defined  and  if  the  basis  for  group  membership  is  stable. 

Ihese  two  conditions  may  often  be  impossible  to  satisfy. 
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7.  Where  the  purpose  is  prediction,  the  evaluation  of  measure- 
ment must  be  based  on  how  well  the  predictor  measure  corre- 
lates with  a measure  of  the  future  event  or  state  to  be 
predicted. 

8.  Ebr  diagnostic  or  certification  purposes,  measurement  should 
be  evaluated  by  logical  or  statistical  relationships  with 
broader  indices  of  proficiency  or  the  diagnostic  categories. 
Such  evaluations  can  be  based  on  the  logic  of  content  samp- 
ling, on  correlations,  or  on  experimental  results. 


IMPLICATIONS  OF  SETTINGS 


1.  The  purposes  of  measurement  define  the  set  of  conditions 
most  appropriate  to  that  measurement;  this  set  of  conditions 
might  be  termed  the  target  conditions.  In  any  setting  differ- 
ing from  the  target  conditions,  the  measurement  setting  should 
be  representative  of  the  target  situation  in  salient  respects. 

2.  Different  settings  may  be  responsible  for  different  contami- 
nating variables  in  measurement?  interpretations  of  the  results 
of  measurement  should  consider  the  possible  distortions  intro- 
duced by  a particular  setting. 

3.  Where  the  measurement  situation  differs  significantly  from 
the  target  situation,  the  generalizability  of  inferences 
from  the  one  to  the  other  must  be  assessed. 

4.  Measurements  in  laboratory  settings  or  simulations  may  fail 
to  generalize  if  they  are  over- controlled,  that  is,  if 
influences  expected  in  the  target  situation  are  not  permitted 
to  vary  in  the  laboratory. 

5.  Generalizability  of  measurement  in  institutional  settings  is 
less  concerned  with  the  generalizability  of  scores  than  with 
the  generalizability  to  attributes  of  greater  institutional 
concern;  usually,  this  form  of  generalizability  is 
expressed  as  predictability. 


\ 

4 

i 

i 

t 

;ft 

4 


j 

-I 

j 


1 

j 


* 


6.  When  measurement  is  done  in  naturalistic  or  field  settings, 
standardization  requires  that  specific  sets  of  conditions 
be  used.  The  problem  of  generalizability  is,  in  such  settings, 
one  of  generalizing  scores  (or  inferences  from  scores)  obtained 
in  the  standard  setting  to  other  relevant  settings. 
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IMPLICATIONS  OF  PERSONAL  VARIABIES 


1.  Tie  variables  in  the  higher  categories  on  this  list  are  more 
likely  to  be  tangible  or  directly  observable  and  less  likely 
to  be  abstract.  Therefore,  they  can  be  measured  more  objec- 
tively, and  mathematically  formal  methods  of  measurement  are 
more  likely  to  be  available. 

2.  Tie  higher  the  category  on  this  list,  the  less  appropriate 
is  conventional  norm-referenced  measurement.  One's  pulse 
rate  after  a period  of  extensive  exercise  is  not  evaluated 
by  its  position  in  a normal  distribution  of  pulse  rates;  it 
is  evaluated  with  reference  to  a standard  given  the  age  and 
exercising  condition  of  the  individual  whose  pulse  is 
measured. 

3.  Variables  high  in  this  list  are  likely  to  be  evaluated 
primarily  in  terms  of  accuracy;  accuracy  is  an  irrelevant 
concern  for  variables  low  on  the  list.  The  notion  of 
accuracy  inplies  a well-calibrated  scale  of  measurement, 
usually  in  units  accepted  by  the  scientific  community. 

4.  Work  sample  tests  are  most  likely  to  be  developed  to  measure 
aspects  of  task  performance,  although  in  some  catponents 
and  under  some  circumstances  they  may  measure  job  knowledge 
variables,  motor  skills,  or  physiological  processes.  Since 
work  saitple  testing  measures  variables  in  the  higher  cate- 
gories, these  variables  should  be  objectively  measured, 
interpretable  with  reference  to  a priori  standards,  and 
capable  of  accurate  measurement  on  a well-calibrated  scale. 

5.  The  literal  measurement  of  one  variable  (e.g. , skin  resis- 
tance to  current)  may  be  chosen  as  a basis  for  inferences 
about  a different  variable  (in  the  exanple,  it  might  be 
anxiety) . Such  inferences  imply  hypotheses  that  need 
enpirical  verification  if  the  inferences  are  to  be  consi- 
dered valid. 


IMPLICATIONS  OF  TASK  VARIABLES 


1.  The  variables  higher  on  the  list,  in  general,  are  associated 
with  greater  opportunity  and  need  for  objective  measurement; 
they  should  be  interpretable  with  reference  to  previously 
established  standards  and  accuracy  of  measurement. 
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2.  The  identification  of  classes  of  task  variables  helps  to 
define  the  nature'  of  a work  sarple  test;  a first  stage 
(and  sometimes  sufficient)  step  in  the  evaluation  of  such 
a test  is  to  evaluate  the  degree  to  which  it  is  congruent 
with  the  work  being  sanpled  on  salient  classes  of  variables. 

3.  Performance  variables  in  the  list  of  personal  variables  are 
likely  to  be  influenced  both  by  task  variables  and  by 
settings. 

4.  The  overall  nature  of  a task  changes  vzith  changes  in  settings; 
it  follows  that  a major  consideration  in  measurement  of  task 
variables  is  the  generalizability  of  scores  or  of  inferences. 
As  a specific  example,  the  task  of  cleaning  a rifle  in  the 
quiet  of  a barracks  is  quite  different  from  the  task  of 
cleaning  the  same  rifle,  with  the  same  dirt,  under  fire.  If 

a task  is  to  be  properly  sampled  in  a work  sample,  the  con- 
ditions of  performance  to  be  inferred  should  be  specified. 
Whether  performance  of  the  task  under  conditions  other  than 
those  specified  will  generalize  to  those  conditions  is  an 
empirical  question. 


IMPLICATIONS  OF  MEASUREMENT  METHODS 


1.  The  greater  the  precision  in  specifying  the  response  to  be 
observed,  the  less  the  ambiguity  and  the  greater  the  objec- 
tivity of  measurement.  Methods  higher  on  the  list  promote 
greater  specificity. 

2.  The  more  objective  or  fundamental  the  measurement  technique 
(for  example,  counting  frequencies) , the  less  the  inference 
required.  Of  course,  one  nay  use  a fundamental  measurement 
for  an  intuitive  inferential  jump  from  it;  such  inferences 
usually  need  enpirical  verification.  In  general,  inferences 
based  on  methods  high  on  the  list  are  more  easily  verified 
than  those  based  on  methods  low  on  the  list. 

3.  Regardless  of  measurement  technique,  some  form  of  reliability 
information  is  essential  to  measurement.  That  reliability 
may  be  the  consistency  assured  by  well-calibrated  instru- 
ments, or  the  agreement  of  independent  observers,  or  the 
internal  consistency  of  scaled  responses  to  a set  of  atti- 
tude items.  Whatever  the  form  of  reliability  of  greatest 
concern,  no  measurement  technique  can  be  evaluated  more 


- 37  - 


highly  than  the  reliability  permits.  Reliability  is  rarely 
a sufficient  evaluation,  even  though  it  is  a necessary  one. 

A set  of  ratings  may  be  highly  reliable  because  of  the 
presence  of  constant  errors,  but  the  reliability  is  of 
very  little  value  if  it  means  no  more  than  consistently 
false  inferences. 

4.  Objectivity  my  be  illusory.  The  presence  of  sophisticated 
instrumentation  is  not  an  assurance  of  objective  measurement. 
The  question  must  be  asked  whether  the  measurement  obtained 
with  such  instrumentation  is  fundamental  measurement,  that 
is,  measurement  to  be  interpreted  in  terms  of  its  own  units, 
or  whether  it  is  a basis  for  a derived  inference. 


SIMULTANEDUS  IMPLICATIONS  OF  VARIABLES  AND  METHODS 


Special  implications  for  the  evaluation  of  measurement  can  come 
from  a simultaneous  consideration  of  the  kinds  of  personal  variables 
being  measured  and  the  method  of  measurement.  In  abbreviated  form, 
condensing  the  classification  of  person  attributes  to  five  categories, 
the  two  classifications  are  shown  in  matrix  form  in  Figure  1.  The 
matrix  is  so  arranged  that  the  upper  left-hand  corner  represents  the 
maximum  possibilities  for  objective  measurement  and  the  lower  right- 
hand  comer  represents  the  maximum  in  necessary  subjectivity. 

In  the  extreme  cases,  measurement  of  physiological  or  psychomotor 
attributes  with  special  measuring  instruments  requires  only  accuracy  in 
the  calibration  of  the  measuring  instruments;  with  accuracy,  questions 
of  reliability  are  moot.  Concern  for  the  generalizability  of  measures 
obtained  from  the  situation  of  actual  measurement  to  targeted  situa- 
tions is,  of  course,  always  a consideration  in  the  evaluation  of  any 
measurement,  but  so  far  as  the  variables  and  methods  are  concerned, 
the  closer  the  situation  to  the  upper  left  of  Figure  1,  the  more 
salient  the  concept  of  accuracy  is  to  the  evaluation  of  measurement. 
Accurate  measurements  are  those  that  are  most  readily  verifiable  with 
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reference  to  some  standard  unit  of  measurement  such  as  an  inch,  a 
gram,  or  a count. 


In  the  other  extreme  is  measurement  using  some  form  of  rating 
scale  for  the  assessment  of  attitudes.  There  is  no  way  in  which  the 
"accuracy"  of  such  measurement  can  be  verified.  It  is  possible  to 
obtain  indices  of  consistency  of  response,  but  there  is  no  way  to 
determine  whether  the  attitude  is  correctly  or  accurately  measured. 

Not  only  are  there  no  standard  units  of  measurement,  but  there  is  no 
external  referent  that  can  be  clearly  said  to  be  a better  or  more 
nearly  precise  statement  of  attitude;  there  is  no  Bureau  of  Standards 
for  attitude  measurement.  Not  even  behavioral  observations  can  be 
used  as  criteria  for  validating  a measure  of  attitude;  too  many  learned 
variables  influence  the  expression  or  inhibition  of  behavior  appropriate 
to  the  attitude.  In  a taste  preference  study,  for  exanple,  one  must 
simply  take  the  subject's  word  for  it  that  he  evaluates  one  stimulus 
higher  than  the  other.  Thus  the  first  kind  of  implication  for  this 
matrix  is  its  influence  on  the  permissible  precision  of  measurement. 

The  above  comments  demonstrate  an  interdependence  of  the  nature 
of  the  variable  being  measured  and  the  method  of  measurement.  Both 
the  nature  of  the  variable  and  the  nature  of  the  technique  influence 
the  saliency  of  different  considerations  in  the  evaluation  of  measure- 
ment. 


Reliability.  Beyond  generalizability,  which  is  universally 
necessary,  the  various  cells  in  Figure  1 identify  ip  to  four  kinds  of 
essential  evaluations  for  particular  combinations.  Cells  marked  with 
an  A are  those  in  which  the  first  step  in  evaluation  is  an  inquiry 
into  reliability.  The  first  step  in  evaluating  reliability  is  not  a 
computation  of  a reliability  coefficient  but  an  examination  of  the 
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technique  of  measurement  itself:  is  the  method  of  measurement  appro- 
priately standardized?  Beyond  that,  the  question  of  reliability  encom- 
passes all  of  the  familiar  concerns  of  equivalence,  stability,  and, 
above  all,  internal  consistency. 

In  a sense,  every  cell  in  the  matrix  should  include  an  A since 
reliability  is  the  sine  qua  non  of  effective  measurement.  The  cells 
of  the  upper  left-hand  comer,  however,  will  have  satisfied  the  needs 
for  reliability  automatically  if  the  measurement  can  be  shown  to  be 
accurate.  Since  accuracy  has  been  identified  as  the  principal  consi- 
deration for  this  set  of  ccnbinations , and  since  unreliable  measure 
cannot  be  very  accurate,  then  the  evaluation  of  reliability  is  super- 
fluous if  accuracy  is  established.  In  all  other  cells,  reliability 
often  must  be  established  as  a basis  for,  or  at  least  a consideration 
in,  any  other  evaluative  determination.  Where  special  instruments 
are  used,  reliability  may  refer  primarily  to  technical  fallibility 
(such  as  trouble  from  poor  electrical  contact) . Where  measurement 
uses  observers,  the  consistency  or  agreement  among  observers  is  the 
essential  reliability.  In  seme  forms  of  physical  or  behavioral 
observation,  the  observing  and  recording  responses  may  be  easy  enough 
that  little  or  no  observer  error  is  possible  or  likely,  and  it  may  in 
such  cases  be  unnecessary  to  become  greatly  concerned  about  reliability. 
Where  observers  are  rating  knowledge  or  cognition  or  attitudes,  they 
are  exercising  their  own  judgments  and,  therefore,  the  likelihood  of 
fallibility  in  measurement  because  of  differences  in  observer  judg- 
ment is  very  real  and  must  be  investigated.  Reliability  in  measure- 
ment by  testinq  is  well-established  in  classical  psychometric  theory, 
as  it  is  in  scaling  and  other  forms  of  ratinq.  Reliability  in  record 
keepinq  is  probably  derivable  from  psychometric  reliability;  the 
consistency  of  record  keeping,  as  well  as  the  consistency  of  inferences, 
may  be  best  determined  by  dividing  records  into  small  units  of  time 
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and  ocnparing  the  data  collected  in  different  time  periods.  The 

various  considerations  needed  for  estimates  of  reliability  will  be 

reconsidered  in  the  discussion  of  generalizability.  j 

1 

Reliability,  it  must  be  eitphasized,  is  necessary  in  all  measure-  j 

ment.  It  does  not  follow  from  that  fact  that  reliability  coefficients 
must  always  be  oonputed.  Where  there  is  evidence  of  accurate  measure-  * j 

ment,  it  is  also  evidence  of  reliability,  because  there  is  no  accuracy 
without  reliability.  Likewise,  where  there  is  evidence  of  validity 
(discussed  below  as  "acceptability  of  inferences"),  it  is  also  evidence  j 

of  reliability,  because  there  is  no  validity  without  reliability. 

The  important  thing  is  to  build  the  measuring  instrument  with  care  to 
insure  iraximum  reliability. 

A notation  (A)  in  Figure  1 denotes  particular  uncertainty  about 
effective  ways  to  estimate  reliability. 

logical  Acceptability.  Once  reliability  is  established,  the  next 
evaluation  concerns  the  acceptability  of  the  operational  definition,  | 

shown  as  B in  Figure  1.  This  is  largely  a matter  of  precision  in  [ 

measurement;  if  measurement  is  fundamental  in  nature,  following  formal  j 

mathematical  axioms,  acceptance  is  highly  probable.  Statistically  1 

? 

derived  or  intuitive  measurements  may  also,  however,  be  widely  accepted, 
sirtply  on  the  basis  of  the  way  in  which  the  measurements  are  collected, 
if  their  logical  foundation  is  persuasive  enough.  One  issue  in  deter- 
mining logical  acceptability  is  whether  the  measurement  fits  its 
purposes  in  relation  to  the  distinction  between  maximum  and  typical 
performance.  If  the  purpose  of  measurement  is  to  find  out  what  people 
actually  do  in  real  situations,  a highly  controlled  estimate  of 
maximum  performance  cannot  be  accepted  on  logical  grounds,  whereas  a 
less  sophisticated  form  of  measurement  obtained  under  more  realistic 
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conditions  — i.e. , more  representative  conditions  — may  be  readily 
accepted. 

The  greater  the  objectivity  in  measurement,  the  greater  the 
likelihood  of  its  logical  acceptability.  Objectivity,  it  should  be 
noted,  is  clearly  distinguishable  from  construct  validity,  despite 
points  of  similarity.  As  defined  in  this  report,  objectivity  depends 
on  the  degree  to  which  the  response  itself  is  free  from  distortion, 
whereas  construct  validity  refers  to  the  degree  to  which  the  interpre- 
tations from  the  response  are  free  from  distortion  by  influences 
unrelated  to  a designated  construct.  Probably  the  greater  the  objec- 
tivity, the  greater  the  construct  validity,  but  the  question  really 
does  not  arise.  What  does  arise  is  the  question  of  whether  the  response 
is  a clearly  identifiable,  interpretable,  unambiguous  response  as 
opposed  to  the  degree  to  which  it  is  undefined  and  subject  to  varying 
interpretations.  An  inference,  even  from  seme  physiological  measure- 
ment, nay  lack  construct  validity  even  when  variables  are  accurately 
measured.  In  medical  diagnosis,  for  example,  physicians  may  find 
symptoms  easily  measurable  but  difficult  to  interpret  diagnostically. 


Under  certain  circumstances,  characteristics  of  distributions  of 
measurements  may  be  considered  in  evaluating  the  logical  acceptability 
of  measurement.  As  just  one  example,  one  may  ask  whether  the  measure- 
ment involves  ceiling  effects  such  that  descriptions  of  individuals 
high  on  a given  attribute  are  inaccurately  obtained  because  of  the 


validity  is  of  both  practical  and  technical  importance.  It  is  practi- 
cally important  because  it  facilitates  judgments  of  logical  accept- 
ability, at  least  in  the  middle  set  of  cells  in  Figure  1.  It  is  tech- 
nically important  because  examinees  or  observers  nay  be  more  appro- 
priately motivated  by  measures  that  "look  right,"  thus  adding  to  the 
objectivity  of  measurement. 

Acceptability  of  Inferences.  Another  set  of  questions  refers  to 
the  acceptability  of  inferences  extending  beyond  the  obvious  content. 

In  the  conventional  way  of  talking  about  psychometric  validity,  most  of 
the  preceding  discussion  on  logical  acceptability  referred  to  so-called 
content  validity.  Questions  of  the  acceptability  of  inferences  are, 
in  contrast,  questions  of  construct  or  of  criterion-related  validity. 
Ihe  cells  marked  C in  Figure  1 are  those  where  attributes  can  be 
satisfactorily  inferred  from  the  measurement  only  on  the  basis  of 
supporting  empirical  evidence.  In  any  specific  case,  if  the  nature  of 
the  measurement  is  inference  rather  than  fundamental  description,  the 
psychometric  concepts  of  the  validity  of  the  inferences  are  the  most 
important  aspects  of  evaluation.  Even  if  the  measurement  ostensibly 
measures  at  a more  fundamental  level,  inferential  junps  from  that 
level  must  be  validated.  Ihe  example  given  earlier  should  be  remem- 
bered: when  one  uses  a physiological  measure  not  as  a description  of 
physiological  functioning  but  as  a manifestation  of  anxiety,  the 
inference  to  be  validated  is  the  use  of  the  measurement  as  an  index 
of  anxiety.  The  accuracy  of  measuring  the  physiological  process  is 
irrelevant.  Vherever  the  measurement  is  intended  to  lead  to  an 
inference  of  attributes  outside  of  its  literal  content,  evidence 
of  some  form  of  validity,  specifically  criterion-related  or  construct 
validity,  is  essential. 


Ihe  crux  of  classical  psychometric  validity  is  the  extent  to 


which  the  variance  in  measurements  is  attributable  only  to  the  con- 
struct intended  to  be  inferred.  Insufficient  validity,  therefore, 
means  that  part  of  the  variance  in  a set  of  scores  is  classically 
seen  (a)  as  being  attributable  to  sources  of  variation  other  than  the 
one  intended  or  (b)  as  irrelevant  to  the  variable  to  be  predicted. 

Standard-Based  Interpretations.  Hie  letter  D appears  in  Figure  1 
wherever  the  obtained  measure  should  be  interpretable  with  reference 
to  a standard.  (Seme  arguable  cells  are  identified  with  the  D in 
parentheses.  These  are  generally  conditions  permitting  substantial 
objectivity  and  in  which  the  accuracy  of  measurement  can  be  assessed. 
Usually,  they  are  examples  where  fundamental  or  mathematically  formal 
measurement  is  plausible. 

In  a sense,  this  could  apply  to  all  of  the  cells;  arbitrary 
standards  or  cutting  scores  oould  be  established.  Cognitive  test 
scores,  for  example,  can  be  interpreted  as  deviations  from  such 
arbitrary  points. 

The  intent  of  the  designation  in  Figure  1,  however,  is  somewhat 
different;  it  is  intended  to  refer  to  standards  defined  in  terms  of 
the  measurement  scale,  not  distribution  of  measurements.  The  intent 
here  is  not  so  much  permissive  as  suggestive.  Wherever  the  purpose 
of  measurement  is  certification  or  institutional  decision-making, 
the  aim  of  test  specialists  should  be  to  provide  measurement  that 
can  be  interpreted  with  reference  to  real  performance  standards. 

SUMMARY 

Heuristic  classifications  of  the  purposes  and  circumstances  of 
psychological  measurement,  of  the  variables  to  be  measured,  and  of  the 
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techniques  available  for  such  measurement  have  been  presented.  TWo 
major  conclusions  should  be  drawn.  First,  conventional  psychological 
testing  is  contained  in  only  a relatively  small  portion  of  all  of  the 
classes  of  psychological  measurement.  A single-minded  devotion  to  the 
principles  and  theory  of  classical  psychometrics  has  many  values,  but 
it  also  has  the  severe  disadvantage  of  ignoring  the  values  of  other 
approaches  to  measurement.  Other  approaches  may  be  more  useful  where 
accurate  descriptions  rather  than  abstract  inferences  are  sought;  even 
testing  for  inferential  purposes  can  be  improved  if  the  methods  of 
obtaining  the  underlying  descriptions  are  more  objective  and  accurate. 

Second,  classical  psychometric  theory  may  be  too  narrow  to  use  in 
the  evaluation  of  measurement  in  some  of  the  classes . Evaluation  of 
measurement  may  include  reliability  and  validity  estimation,  to  be 
sure,  but  it  should  also  include  a logical  evaluation  of  measuring 
techniques  as  operational  definitions  of  variables,  and  it  should  seek 
more  frequent  application  of  the  usual  scientific  practice  of  inter- 
preting measures  with  reference  to  a priori  standards. 

The  classifications,  and  the  broad  conclusions  reached  from  consi- 
dering them,  apply  to  work  sanple  testing.  Work  samples  may  be  used 
for  any  of  the  purposes  of  measurement,  although  in  these  reports 
they  are  primarily  considered  for  certification  purposes.  Vbether 
the  product  is  scored  or  the  process  of  getting  it,  work  sanples  fit 
in  any  kind  of  setting;  again,  however,  the  interest  of  this  report 
is  primarily  in  settings  of  institutional  control.  With  reference  to 
Figure  1,  work  sanples  are  most  likely  to  be  tests  of  performance, 
although  they  may  include  any  of  the  classes  of  variables  represented 
by  the  top  three  rcws  or  the  classes  of  methods  in  the  tliree  columns 
on  the  left. 


i 
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The  ccrnon  requirements  for  the  evaluation  of  measurement  in 
those  nine  cells  are  (a)  assessing  the  logical  acceptability  of  the 
measurement  and  (b)  the  possibility  of  interpreting  scores  with 
reference  to  a standard.  Neither  of  these  kinds  of  evaluation  invokes 
classical  concepts  of  validity,  although  evidence  of  validity  may  pro- 
vide further  argument  in  the  logic  supporting  a measure  as  an  opera- 
tional definition  of  the  variables  measured.  Moreover,  conventional 
validity  is  probably  necessary  for  job  knowledge  or  for  some  perform- 
ance variables  if  these  are  assessed  by  direct  observation  instead  of 
through  tests  or  physical  instrumentation.  In  short,  despite  the 
fact  that  conventional  validities  may  provide  useful  information, 
inferences  of  attributes  beyond  the  obvious  content  of  the  work  sanple 
itself  are  often  conspicuously  absent  from  work  sanple  testing  and, 
for  these  cases,  conventional  statements  of  validity  may  be  super- 
fluous and  even  misleading. 

This  is  not  meant  to  inply  that  criterion-related  or  construct 
validation  of  inferences  from  work  sample  performance  is  necessarily 
inappropriate.  The  point  being  stressed  here  is  that  the  evaluation 
of  work  sanple  measurement  is  not  fundamentally  an  evaluation  of  its 
use  in  the  measurement  of  an  inferred  construct  or  of  its  power  to 
predict  some  external  behavior;  rather,  a work  sanple  is  evaluated 
primarily  on  its  acceptability  as  a direct  description  of  the  perform- 
ance of  interest.  The  demands  of  this  kind  of  evaluation  need  careful 
explication. 
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